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ABSTRACT 

The detection of differential item functioning (DIF) 
has become an important psychometric research topic in recent years. 
A number of item response theory (IRT) methods for solving this 
problem have been suggested. A common approach is to calculate some 
function of the area between item response curves estimated from the 
subpopulations of interest. While these methods relay overall item 
level DIF information, they do not indicate the location and 
magnitude of DIF along the ability continuum. In order to provide 
these important details, this paper presents a method for producing 
simultaneous confidence bands for the difference between item 
response .curves. After these bands have been plotted, the size and 
regions of DIF are easily identified: Implementation considerations 
and illustrative examples are also given. One figure illustrates the 
discussion, and an appendix presents elements of information matrices 
associated with different parameters. (Contains 21 references.) 
(Author/SLD) 
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Abstract 

The detection of differential item functioning (DIF) has become an important psychometric 
research topic in recent years. A number of item response theory (IRT) methods for solving 
this problem have been suggested. A common approach is to calculate some function of the 
area between item response curves estimated from the subpopulations of interest. While 
these methods relay overall item level DIF information, they do not indicate the location and 
magnitude of DIF along the ability continuum. In order to provide these important details, 
this paper presents a method for producing simultaneous confidence bands for the difference 
between item response curves. After these bands have been plotted, the size and regions of 
DIF are easily identified. Implementation considerations and illustrative examples are also 
given. 
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Graphical IRT-Based DIF Analyses 
For many large and small standardized testing programs, checking for differential 
item functioning (DIF) has become a routine practice. This exercise exposes items that favor 
one subgroup over others due to characteristics that might be extraneous to the attributes 
being tested. With regard to this objective, impact studies (i.e., comparing average 
performance across groups) are insufficient unless average group performances are known to 
be equal, a priori. Preferred DIF methods partial out examinee abilities (or proficiencies) in 
some manner when comparing groups on a single item. 

Two general approaches to this problem are usually taken. The first utilizes observed 
scores in DIF analyses. The most popular of this type is the Mantel-Haenszel (MH) 
procedure (Holland & Thayer, 1988), which is a chi-square type test. Within this procedure, 
so-called focal and reference groups are matched on an observed score, that may or may not 
include the item being investigated. Other observed score analyses that have been suggested 
include, the transformed item difficulty method (Angoff & Ford, 1973), the chi-square 
method (Camilli, 1979), and the standardization method (Dorans & Kulick, 1986). 

The second general approach can be referred to as model-based analyses. The 
procedures that fall under this classification utilize examinee ability or true score estimates, 
and typically model the item attributes more extensively than those that use observed scores. 
Examples of these can be found in Cohen, Kim, and Subkoviak (1991); Lord (1977a, 
1977b); and Thissen, Steinberg, and Wainer (1988). Comparisons of IRT with observed 
score methods can be found in Hambleton and Rogers (1989); lronson and Subkoviak (1979); 
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Shepard, Camilli and Averill (1981); Shepard, Camilli and Williams (1985); and Subkoviak, 
Mack, Ironson and Craig (1984). One common approach is to calculate some function of the 
area between item characteristic curves (ICCs) estimated from the subpopulations of interest. 
[Formulas for computing the area between cetain ICCs have been derived by Raju (1988)]. 
Unfortunately, this method yields only a single statistic and as such does not present an 
investigator with a clear picture of DIF over all parts of the ability range. 

When entire ICCs have been employed (see, for example, Linn, Levine, Hastings and 
Wardrop, 1981; and Shepard, Camilli and Williams, 1984), only point-wise confidence bands 
have been used to account for sampling error, even though simultaneous confidence bands 
are usually more appropriate. In addition, the experiment-wise significance levels have not 
been properly controlled, in the past, to allow for sound statistical conclusions to be drawn. 
The purpose of this paper is to provide investigators with an IRT-based DIF analysis that 
utilizes simultaneous confidence bands for the difference between ICCs, and controls for the 
experiment-wise significance levels. After residual ICCs and their associated confidence 
bands are plotted, investigators are able to conveniently observe DIF regions and magnitudes. 

The procedure presented here follows directly from the methodology for creating 
simultaneous confidence bands for single ICCs. With this in mind, an approach for 
calculating individual confidence bands will first be briefly discussed. Afterwards, a method 
for deriving confidence bands for the difference between two ICCs will be presented. 
Implementation considerations and examples are then discussed, followed by a summary. 
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Individual Simultaneous Confidence Bands 
As mentioned in the introduction, the methodology for obtaining simultaneous 
confidence bands for the difference between two ICCs, follows directly from similar 
procedures for creating individual ICCs. Therefore, this basic methodology is presented here 
for completeness. Procedures for creating simultaneous confidence bands for logistic models 
whose logit is linear (i.e., the Rasch and two-parameter models) have been developed by 
Hauck (1983). This technique is very straightforward and will not be reproduced here. Note 
that the simultaneous confidence bands introduced by Hauck, and those presented here, 
assume that examinee proficiencies are known. 
Individual Simultaneous Confidence Bands for the 3PL 

The three-parameter logistic (3PL) does not possess a linear logit, and so Hauck's 
general approach can not be applied. The procedure presented here is similar to the Scheffe 
method for regression models [see, for example, Rao (1973), Sec. 4b. 2]. The basic 
approach and final results are the same as those presented in Lord and Pashley (1988), 
although, some of the intermediate steps presented here are different. A more general 
discussion of this problem can be found in Thissen and Wainer (1990). 

A common form of the 3PL, found in Lord (1980) and elsewhere, is given by 

™ = c + i ' « > 0. 0 £ c < 1, 

= c + (1 - c)nDa(6 - b)] , 

where P(6) denotes the probability of correctly answering an item given an examinee ability 
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level 0; a, b, and c represent the discrimination, difficulty and lower asymptote item 
parameters, respectively; D is a constant, usually set to 1.7 or 1.702; and ¥(•) is the logistic 
function. 

The form of the 3PL used in this paper is given by 

P(0) = c + (1 - c)*(A9 + B) , 

where Ad + B is simply a reformulation of Da(6 - b). Maximum likelihood estimates 
(MLEs) of A and B can be obtained from the MLEs of a and b, due to their invariance 
property, as follows: 

A = D&, and B = -Dab. 

To simplify notation, let the vectors .8 = (A, B, c)' and # = (A, B, c)' contain the 
unknown item parameters and their corresponding MLEs, respectively, associated with a 
single item. We assume that 0 was estimated from a sample of known 0,'s (/ = 1, 2, N) 
whose properties do not preclude the usual asymptotic normality assumptions associated with 

the distribution of /3. In particular, we assume that with sufficiently large N, the quadratic 
form 

where E is the asymptotic covariance matrix associated with #; - reads "is approximately 
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distributed as"; and \o) denotes the central chi-square distribution with three degrees of 
freedom. 

As is commonly done, the expected sample information matrix, denoted I, will be 
used as an estimate of 2T 1 . One could also employ the observed sample information matrix, 
though this alternative may be less well-behaved under certain circumstances. After 
substituting I into (1), we can define a 1 - a confidence ellipsoid for (3 by 

Prob[(/3 - fo'W - ft) < xa«) 1 - 1 - a f (2). 

where xa«) denotes the upper a percentage point of the xo) distribution. Note that since the 

quadratic form is only "approximately distributed" as chi-square, the inequality in (2) and all 
related inequalities and equalities that follow, actually only denote asymptotic 
approximations. 

Now let the three-dimensional constraint space for 0, defined by (2), be denoted by 5. 
Then for a fixed 0, we can map S into a one-dimensional constraint space for P(0), which 
can be represented by the interval 

{min[P(0)|j365J, max[P(0)|/3eS]} . 

Then using a repeated sampling argument, these intervals will define a 1 - a simultaneous 
confidence band for P(0), over all 0. What remains to be done is to provide a procedure for 
obtaining, for any fixed 0, the endpoints of these intervals. 
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The task of finding the maxima and minima of P(0), for a fixed 0 and given a 
constraint space S, can be formulated in non-linear programming terms. The general 
problem can be stated as 

optimize P(A, B, c\ 6) 

subject to $ - B)'l$ - B) «S xo.a) , 

where the notation P(A, B, c; 8) is used to emphasize which parameters will be varied in 
order to obtain an optimal (i.e., maximum or minimum) value of P(6). 

In order to simplify this problem, a further reparameterization is useful. In 
particular, we let 

L - AO + B . 



Then 



P(A, B, c; 6) = P(L, c; 0) » c + (1 - c)*(L) 



The constraint space can be similarly reparameterized, and at the same time reduced 
from an ellipsoid to an ellipse on the Lxc plane, by using the linear transformation/reduction 
matrix 



M = 



'6 (T 
1 0 
0 1 
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as follows: 

0 - - ft > 0 - fl'MCM'I^Mr'M'tf - 0) - (i - «)'J(3 - 5) , 

where 

5 = M'0 = (L, c)' , 

5 = M'j& = ( L, c)' , and 

J = (M'l^M)" 1 . 

In addition, note that the first partial derivatives of the objective function, 

dP(L, c; 6) m 1 -c md 

dL (1 + e L )(l + e' 1 ) ' 

3P(L, c; fl) = 1 
dc 1 + e L 

are always positive. Hence, the optimal values of P(L, c; 0), for a fixed 0, will be found on 
the boundary of the constraint ellipse. The reparameterized and simplified optimization 
problem is then 

optimize P(L, c; 0) 

subject to (6 - 6)' J (6 - 6) = x?3.«) • 

After solving the quadratic constraint equation for L and evaluating appropriate ranges 
for c, the problem of finding a lower bound can be stated as 
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minimize c + (1 - c)Sf(L) 



subject to L = L + 



(c - cV u - fiutfto - (<? - cfjJ^ - Jl) 




Similarly, the problem of determining the upper bound can be formulated as 



maximize c + (1 - c)¥(L) 



A 

subject to L = L + 




< c < c + 




r2 



Unfortunately, as P(L, c; 0) does not constitute a convex set, multiple local maxima 
and minima are possible. However, two line searches for the maxima and minima can easily 
be conducted by varying c between the extreme values indicated. Formulae for calculating 
the J^s are given in the Appendix. 

Residual Simultaneous Confidence Bands 
We will proceed in a fashion similar to the one taken in the previous section. First 
let the vector X = (A F , B F , c F , A R , B R , c R )' contain the item parameters corresponding to 



the focal (F) and reference (R) groups (i.e., the two groups of interest). Then after making 
the usual normality assumptions, the residual optimization problem can be expressed as: 
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optimize P(A F , B F , c F ; 6) - P(A K , B R , c F ; 6) 
subject to X'E^X ^ x?6.a> • 

We now evoke the assumption of local item independence. This implies that the 
between-group item parameter covariances are all equal to zero. The above constraints can 
then be written as 

where 0 F = (A F , B F , c F )' and /3 R = (A R , B R , c R )' . 

With this insight, the original residual optimization problem can be rewritten as 
optimize P(A F , B F , c F \ 0) - P(A R , B R , c F \ 6) 

subject to 0££jA < 7X(6,„) 

fe'iftt ^ (1 " 7)X?6.o 
0 < 7 < 1 . 

Written in this fashion, the residual optimization problem can now be undertaken as a 
series of individual simultaneous confidence band problems. This is achieved in practice by 
increasing y incrementally, for a fixed 0, and recording the maximum and minimum 
differences. 
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Implementation Considerations and Examples 
Various sampling and calibration procedures have been suggested for IRT-based DIF 
analyses in the past. The methodology presented in the previous section should conform to 
most of these procedures. Only one method, however, was used to produce the examples 
that follow. This approach is comprised of the following four basic steps: 

1) Select representative samples from focal and reference groups. 

2) Calibrate all examinees together on a set of non-DIF items. 

3) Calibrate items of interest, separately by group, using ability estimates obtained in 
Step 2. 

4) Calculate and plot residual ICCs and corresponding simultaneous confidence bands. 
Note that while matched samples are not required, obtaining representative samples from the 
subgroups of interest should, in most cases, be very important. 

Examples 

Items from a large-scale Educational Testing Service administered examination were 
investigated. The focal and reference groups of interest were female and male examinees, 
respectively. The samples were comprised of approximately 1,250 females and 1,500 males. 
All examinees were first calibrated together on 98 operational (assumed to be non-DIF) items 
to obtain proficiencies on a common scale. Holding these values fixed, experimental items 
were then calibrated, separately, for the two groups. The results for three experimental 
items are presented here. These three items had previously been classified, using different 
samples, as "A", "B", and "C" items, based on a MH analysis. These three classifications, 
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"A", "B M and "C", refer to low, medium and high levels of DIF, respectively. 

Residual ICCs and associated simultaneous 95% confidence bands, for each of the 
three items, are given in Figure 1, alongside corresponding individual ICCs with labeled 
ranges containing evidence of DIF , The results are clearly in agreement with the previous 
MH analysis. The simultaneous confidence band for the first ("A") item completely 
encompasses the zero-residual reference line, indicating no evidence of DIF along the entire 
range of proficiencies. The band for the second ("B") item dips slightly below the reference 
line, within the .2 to 1.4 range of abilities. The third ("C") item's band drops significantly 
below the reference line, between the ability values -1.2 and 1.8. 

Discussion and Conclusions 
Since some subgroups are not always well represented across the entire ability scale, 
matching samples, as required by most observed score analyses, can be a problem. As 
evident in the previous section, the calibration samples used in the graphical IRT-based 
approach need not be matched. In addition, as seen in the Figure 1, the location and 
magnitude of the DIF along the ability scale is easily appreciated with the proposed 
approach. 

The three example items given confirmed the results of a previous MH analysis. In 
general, this is exactly what one would hope to find. However, as with certain other IRT 
approaches, the method presented here allows for the possibility of uncovering bi-directional 
DIF, where the corresponding ICCs cross each other. In these cases, the focal group 
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performs better than the reference group across some parts of the ability continuum, but 
worse across other segments. These "plus" and ,f minus M DIF regions can cancel each other 
out within a MH analysis. 

The method presented in this paper utilizes simultaneous confidence bands for the 
difference between ICCs. One may ask whether there are occasions where point-wise bands 
would be more appropriate. In cases where only a few points along the ability scale are of 
interest, calculating simultaneous bands could be excessive. For example, certification tests 
may possess one or two cut-scores of special interest. In these cases point-wise confidence 
intervals may actually be preferable. Otherwise, if a large segment of proficiency scale is of 
interest, calculating simultaneous confidence bands would usually be more appropriate. 

While this proposed procedure maintains the trappings of a significance test, it should 
be regarded as an exploratory data analysis technique, as all sources of error are not 
accounted for. In particular, the sampling errors related to the estimation of examinee 
abilities are not included in the calculation of the simultaneous confidence bands. In any 
case, the DIF effect size should be viewed as the most important aspect to consider. The 
proposed method simply tempers enthusiasm for seemingly significant effect sizes by clearly 
illustrating associated sampling variation. 
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Appendix 

The elements (7^) of the information matrix J associated with the parameters L and c, 
can be expressed in terms of the elements (/^) from the information matrix associated with 
the parameters A, B and c, as follows: 



—as v **a 



J = / - 

ff ff - 20/^ + e 2 i, 



AA 



BB 



Lord and Pashley (1988) reported the elements of the information matrix associated 
with the parameters A, B, and c, in terms of the elements of the usual information matrix for 
a, b, and c (Lord, 1980, p. 191), as follows: 

' b, 



1 M - 



00 a 



a 



Lb = 



D 1 



i -h 
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D 2 a 



'bb 
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D 2 a 
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'be 



he = 



D 



'be 



Da 

The element l cc remains the same under these two parameter definitions. 
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Figure Caption 

Figure 1 : Graphical IRT-based DIF results for three items. Each panel contains (1) a 
residual plot with an associated 95% simultaneous confidence band, and (2) a graph 
illustrating the ICCs pertaining to the focal and reference groups. The regions labeled "DIF" 
refer to the ranges of 0 for which there is evidence of DIF. 
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